Prokaryotic virus host prediction with graph contrastive augmentaion

Prokaryotic viruses, also known as bacteriophages, play crucial roles in regulating microbial communities and have the potential for phage therapy applications. Accurate prediction of phage-host interactions is essential for understanding the dynamics of these viruses and their impacts on bacterial populations. Numerous computational methods have been developed to tackle this challenging task. However, most existing prediction models can be constrained due to the substantial number of unknown interactions in comparison to the constrained diversity of available training data. To solve the problem, we introduce a model for prokaryotic virus host prediction with graph contrastive augmentation (PHPGCA). Specifically, we construct a comprehensive heterogeneous graph by integrating virus-virus protein similarity and virus-host DNA sequence similarity information. As the backbone encoder for learning node representations in the virus-prokaryote graph, we employ LGCN, a state-of-the-art graph embedding technique. Additionally, we apply graph contrastive learning to augment the node representations without the need for additional labels. We further conducted two case studies aimed at predicting the host range of multi-species phages, helping to understand the phage ecology and evolution.


Introduction
Prokaryotic viruses, including phages and archaeal viruses, play a crucial role in diverse ecosystems such as limnetic, marine, and soil systems [1][2][3][4].Viruses represent the most extensive group of organisms harboring unexplored genetic diversity.The application of metagenomic techniques has facilitated a rapid surge in the identification of novel viruses.Among them, phages, which are integral components of the human microbiota, have been shown to exert influence on gut health and the development of certain diseases.[5] Furthermore, the escalating challenge of combating antibiotic resistance in bacteria poses a serious threat to the effective control of bacterial infectious diseases [6], phages are also used as therapy for treating diseases caused by bacteria [7].As viruses are unable to survive independently, investigating their host is crucial.Prokaryotic viruses typically have specific hosts and inject their genetic material into host cells, utilizing low-molecular substances to propagate.Despite the recognition among researchers of the significance of the interaction between prokaryotic viruses and hosts, traditional laboratory culture experiments are time-consuming and expensive [8].More critically, less than 1% of microbial hosts have been cultivated in the laboratory [9,10].Hence, there is a pressing need to develop computational tools for accurately identifying prokaryotic virus hosts.In recent times, various computational approaches have been employed for host prediction, which can be broadly categorized into two groups: alignment-based methods and alignment-free methods.Alignment-based methods rely on sequence similarity between viruses and prokaryotes, as gene fragments may be shared between them.Such gene fragment sequences are from the spacer sequences for the CRIPSR [11] system.These sequences are obtained from DNA fragments of viruses that have previously infected the prokaryotes.When these viruses with recorded sequences attack again, the prokaryotes can employ CRISPR-based mechanisms to destroy the viral DNA and protect themselves.Thus, CRISPR can be considered as compelling evidence of virus-prokaryote interactions, owing to its infection and protection mechanism.However, the use of CRISPR-based evidence to identify interactions between viruses and prokaryotes is limited, as only 40%-70% of prokaryotes encode a CRISPR system [11], and many lack spacer sequences from viruses.BLAST [12] is another widely used alignment-based method for predicting the host of viruses.It identifies short, similar segments between the query and database sequences, and provides information on similarities and differences between the two sequences.Compared to the CRISPR-based approach, the BLASTbased approach generally exhibits lower accuracy but can be applied to a wider range of viruses.However, even though some sequences may have exact matches, they may fail to provide information for host prediction, such as conserved sequences around integration sites [13,14].
Alignment-free approaches are more flexible as they do not rely on direct sequence comparison.One commonly used approach is the utilization of k-mers, which are short subsequences of fixed length (k) that can be used to identify similar sequences or characterize the composition of a new sequence [15].K-mer-based methods are computationally efficient and can be applied to a wide range of sequence data, including viruses and prokaryotes, without the need for sequence alignment.They are particularly useful for identifying similarities and patterns in large datasets and can be utilized in host prediction algorithms as a feature extraction technique.VirHostMatcher (VHM) [16], Phage-Host Interaction Search Tool (PHIST) [17] and prokaryotic virus host predictor (PHP) [18] are based on k-mer features to predict the virus hosts.VHM predicts the putative host of each input virus by leveraging similarity in oligonucleotide frequency patterns between the virus and potential hosts, and selecting the one with the smallest dissimilarity.On the other hand, PHIST predicts prokaryotic hosts of viruses by identifying exact matches between had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
viral and host genomes using the Kmer-db tool and PHP employs k-mer features to train a Gaussian mixture model for host prediction.Deep learning-based approaches integrate the information of sequences and artificial neural networks like convolutional neural networks (CNNs) [19] to make predictions.For example, DeepHost [20] applies CNN architecture and treats the host prediction task as a multi-classification problem.The VHM-net [21] constructs a network containing heterogeneous features between viruses and prokaryotes and uses the Markov random field to predict the virus-host interactions.Both HostG [22] and CHERRY [23] construct knowledge graph and apply graph convolutional neural networks (GCNs) [24] for prediction.HostG also regards the prediction task as a multi-classification problem while CHERRY utilizes graph autoencoder architecture and considers the prediction task as link prediction task which take the host with the highest predicted score as the result.
In this study, we propose a novel method called PHPGCA where we approach the host prediction task as a recommendation task, aiming to recommend the host with the highest probability for specific viruses.Recognizing that traditional supervised-learning formulations may suffer from the lack of labeled interactions between viruses and hosts, we leverage the implicit information from unlabeled interactions to solve the host prediction task.Moreover, we introduce an auxiliary self-supervised task to further enhance the robustness of our model.This auxiliary task generates multiple representation views with noise, maximizing the consistency between different perspectives of a particular node compared to those of other nodes with graph contrastive augmentation.To optimize both the semi-supervised and self-supervised tasks, we employ a multi-task training strategy.Our approach is compared with state-of-theart methods, and the results demonstrate the superior performance of our model in host prediction accuracy.

Datasets
The evaluation of the performance of different methods is conducted on CHERRY dataset [23].We download the viruses and prokaryotes from the github repository, containing 1875 viruses.The CHERRY dataset can be split into training and testing data according to the provided raw files, encompassing 1260 positive pairs for training and 615 positive pairs for testing.The training dataset comprises viruses from 174 distinct species, whereas the testing dataset encompasses viruses from 88 distinct species.Notably, there is an overlap of 56 species between the training and testing datasets.

PHPGCA model
In the following sections, we will explain our framework in meticulous detail.Our framework is bifurcated into three distinct components.The first part entails the construction of a phageprokaryote heterogeneous graph, while the second part involves the utilization of graph encoder to encode the embeddings of nodes.The final component of our framework revolves around the application of a multi-task training strategy.This strategy involves training the model using multiple tasks, combining semi-supervised learning and self-supervised learning techniques.Through this training process, we enhance the model's ability to predict the host of phages accurately.Fig 1 provides a visual depiction of the framework, serving as an illustrative guide to its underlying structure and processes.
Construction of the heterogeneous graph.As the initial step in our framework, we focus on the construction of a heterogeneous graph that incorporates both viruses and prokaryotes.This graph serves as a representation of the potential topology and relationships between viruses and prokaryotes, allowing us to capture comprehensive features.To be more precise, the heterogeneous graph G = (O, E) is composed of nodes o i 2 O, where i = 1, 2, . .., M. Each edge between o i and o j is denoted as a tuple (o i , o j ) 2 E. The graph consists of two types of nodes: viral nodes v i 2 V and prokaryotic nodes p i 2 P. The connections between viruses and prokaryotes are the main links in the virus-prokaryote heterogeneous graph.We adhere to the graph construction method outlined in CHERRY [23] to construct the heterogeneous graph.Fig 1A provides an illustration of the components involved in building the heterogeneous graph.
(1)Edge construction of virus-virus Protein plays a pivotal role in the biological composition of organisms and serves as a crucial benchmark for assessing functional similarity among different species.Following the approach in [23] and [22], we establish connections between different viruses.Given two viruses, denoted as A and B, the probability of these viruses sharing at least c common protein clusters is calculated using Eq (1).
a and b are the numbers of proteins contained in A and B. Based on the assumption that A and B share c common protein clusters with different hosts, the probability P is hoped to be smaller than a cutoff and we followed [23] to link two viruses if the probability is smaller than τ 1 : (2)Edge construction of virus-prokaryote We then construct the virus-prokaryote edges with three different types of links: CRISPR, BLAST and the established interactions derived from training data.The parameter selection aligns with the methodology described in CHERRY [23].
Eq (3) represents the formulation of edge construction for virus-prokaryote connections in the heterogeneous graph.Based on the three types of links (CRISPR-based, BLASTN-based, and known dataset-based), the edges between specific viruses and prokaryotes are created in the graph.
Light graph neural network with contrastive augmentation.We utilize the virus-prokaryote heterogeneous graph constructed above to predict the host using the graph contrastive learning model.The main task pipeline involves parameterizing the two distinct types of nodes, viruses and prokaryotes, as embeddings.These embedding parameters are learned using traditional machine learning techniques or deep learning methods.The learned embeddings are then utilized to compute the scores through dot product.In our proposed approach, we utilize a graph convolutional encoder to obtain embeddings for the virus and prokaryote nodes in the heterogeneous graph.During the model parameter learning process, contrastive learning aids in augmenting the available data without requiring additional labels.Subsequently, embeddings for all nodes in the graph are learned, enabling calculation of prediction scores between a specific virus and all prokaryotes.
(1)LGCN Encoder Graph neural networks(GNNs) are powerful architectures for semi-supervised learning on graph-structured data.The basic idea of GNNs is to learn the representations for nodes by aggregating the information of nodes.Almost all variants of GNNs include neighborhood aggregation operation: where h ðkÞ v and h ðkÞ p denote the representation of the nodes after k layers propagation in the graph.N v denotes the set of nodes that are interacted with node v.The AGGR function controls how to utilize the representations.For example, the most prevalent GNN model Graph Convolutional Network(GCN) summarizes the features of neighbors and transforms them by weight matrices and nonlinear activation to obtain the new representation of a target node.However, not all applications require complex GNN model and the simplified model LightGCN(LGCN) [25] is chosen to be the backbone of our model.
LGCN is initially proposed to address the challenges in recommender systems by simplifying the graph model architecture to alleviate training difficulties.In contrast to conventional recommendation tasks that commonly utilize bipartite graphs, our graph encompasses a more intricate structure by encompassing virus-virus connections in addition to virus-prokaryote connections.This deliberate design choice enables us to comprehensively capture a broader spectrum of relationships and interactions within the phage-host ecosystem, thereby advancing our understanding of its complexity.Light Graph Convolution(LGC) and Layer Combination are the key components of LGCN.LGC simplified the AGGR function of the neighborhood by discarding sophisticated transformation and is defined as: Norm v2v vw ¼ 1 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi jN ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi jN Norm v2p vp ¼ 1 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi jN Norm p2v pv ¼ 1 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi jN p2v p j q ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi jN Where v and w are viral nodes, while p represents a prokaryotic node belonging to P. N LGC abandons the operation of aggregating the target node itself.Instead of only using the embeddings from the last layer, combining embeddings from each layers contains more information.The trainable parameters are the embeddings of viruses and prokaryotes, which are used to be the embeddings at the 0-th layer.Once the initial embeddings are given, the embeddings of different layers can be computed by the LGC.We obtain the final embeddings by aggregating each layer and defined as: Where K denotes the number of graph convolution layers.
The LGC can reduce the oversmoothing problem by simplifying the operation of aggregating while the layer combination strategy can capture semantics from different layers, making the representation more powerful.Considering the whole graph, the model of the matrix form can extend from the message passing form.Let the virus-virus interaction matrix be R vv 2 R V�V and virus-prokaryote interaction matrix be R vp 2 R V�P , where V and P denote the number of viruses and prokaryotes in the heterogeneous graph.The value of R vw is 1 if there is a connection between virus v and w otherwise 0, while R vp is 1 if virus v has interaction with prokaryote p.The adjacency matrix of the heterogeneous graph can be denoted as: D is the degree matrix and D ii denotes the number of interactions of the node i calculated via Eq (13).The matrix form of LGC can form as: Where the embedding matrix of 0-th layer H ð0Þ 2 R ðVþPÞ�C .The final embeddding matrix can be computed as: We use Xavier uniform initialization [26] to randomly init the embeddings of virusesand prokaryotes nodes and set the dimension size to 128.
The LGCN encoder with perturbation introduces slight differences compared to the original LGCN encoder.The perturbed version adds noise to each hidden layer, resulting in the generation of different representation views for graph contrastive augmentation.The upcoming section will provide a detailed description.
(2)Graph contrastive augmentation Due to the issue of data sparsity in biological data, supervised learning methods may ignore much of the information present in the raw data.Contrastive learning (CL) [27] can address this limitation by extracting general representations from massive unlabeled data without requiring annotations, thereby serving as an auxiliary technique to enhance existing models and make them more robust.The core idea of CL is to augment data by leveraging representational invariances.In the case of GNN architectures, dropout of nodes or edges is commonly used to create diverse graphs and improve the generalization capability of graph models.However, applying CL by perturbing the graph structure can be time-consuming and challenging to manipulate.In our approach, inspired by [28], we focus on the embedding space for contrastive learning.Given a node v, different augmentation views of the node form as: 1Þ controls the range of augmented embeddings, which will not create much disturbance, remaining much information from the origin embeddings.The final embeddings are computed as: We follow [28] and skip the input embeddings H (0) to achieve better performance, The operation of perturbing the origin embeddings from each layer is easier to manipulate than dropping out nodes or edges from the graph.
Model training.We leverage a multi-task training strategy to jointly optimize the semisupervised learning task and self-supervised learning task, as defined in Eq (18) Where L main refers to the training loss of the main task, while λ is a parameter that determines the magnitude of the contrastive learning task.Additionally, L aux refers to the training loss of the auxiliary task.
The host prediction for viruses serves as the main task, while the contrastive learning task is utilized as an auxiliary approach to augment data.The trainable parameters of the graph model are limited to the embeddings of viruses and prokaryotes in the 0-th layer.To optimize these parameters, previous approaches typically treat it as a supervised learning task, where the supervision signal is derived from the observed interactions or the linked edges in the graph.
We utilize the Bayesian Personalized Ranking (BPR) loss with negative sampling to optimize the semi-supervised learning task.The BPR loss is commonly used in CF models, as it encourages the prediction scores of observed interactions to be higher than those of unobserved pairs: Where ŷvp ¼ h T v h p is the preference score.σ(�) is the sigmoid function.q is the prokaryote node sampled from the unobserved connection of virus node v.For the self-supervised auxiliary task, we do not require additional label annotations.Instead, we utilize the two views of nodes in the heterogeneous graph.For the same node, we treat the views as positive pairs, while negative pairs are formed with different nodes.The positive pairs encourage consistency between different views of the same node with perturbation, while also strengthening the distinction between different nodes.Following the approach proposed in [28], we adopt the InfoNCE Loss [29] to optimize the self-supervised augmentation task.This involves minimizing the agreement of negative pairs and maximizing the agreement of positive pairs, without the need for additional label annotations: where N represents the node set of the heterogeneous graph, and ρ > 0 (e.g., 0.2) is the temperature hyperparameter.We optimize the model using the Adam optimizer [30] with a learning rate of 0.001.Following the approach proposed in [28], we choose the temperature ρ = 0.2, which has been reported as the optimal hyperparameter for the contrastive augmentation task.
Once the model is trained, each node in the heterogeneous graph is assigned a final embedding.These embeddings capture the representations of the nodes in a lower-dimensional space, encoding their characteristics and relationships.To predict virus-host interactions, we calculate the scores between the test virus nodes and all prokaryote nodes using a dot product operation, as described in the model training section.These scores reflect the similarity or affinity between the viruses and prokaryotes.By sorting the scores in descending order, we can identify the top scoring pairs, which correspond to the most likely virus-host interactions.

Evaluation criteria
To compare our model with state-of-the-art methods, we use the accuracy metric to evaluate the performance which is shown as Eq (21):

Accuracy ¼ number of correct predictions number of prediction samples ð21Þ
The host prediction for the specify virus is correct if the taxon of predicted prokaryote is same as the known interaction.

Performance comparison of different neural network-based methodologies
In the host prediction experiment, we use the CHERRY dataset mentioned above to evaluate the performance of our model and compare it with other state-of-the-art host prediction tools, containing PHIST [17], PHP [18], DeepHost [20], VHM-net [21], CHERRY [23], ranging from species to family level.For DeepHost and PHP, we retrain the models using the given hyperparameters in their respective papers to adapt to the datasets.For CHERRY, we reconstruct the knowledge graph and retrain the model.As for other learning methods, since they are difficult to retrain, we use their pre-trained models to evaluate their performance.As shown in Fig 2, our approach outperforms other state-of-the-art methods on CHERRY dataset, from family level to species level.In the case of the CHERRY dataset, PHPGCA outperforms the second-best method, CHERRY, by a margin of 2%.It is observed that the prediction accuracy improves as we move from family to species level, as features with more information from higher taxonomic rankings are easier to distinguish.
Performance on different similarity between training and testing set.We employ the dashing [31] algorithm to compute the similarity between the training set and the testing set.For each test virus, we calculate the similarity between all the training viruses and select the maximum similarity value as an indicator of how closely the test virus resembles the training set.Subsequently, we segment the test viruses into different similarity threshold categories and evaluate accuracy on the species level.Fig 2B depicts the outcomes, revealing that as the similarity between the training and testing sets becomes more pronounced, the performance of all methodologies exhibits an upward trend.Importantly, PHPGCA continues to demonstrate superior performance in comparison to other methods.
Performance on different length contigs.Initially, we undertake the creation of contigs of varying lengths by randomly segmenting viral contigs into three specific lengths: 1kbp, 2kbp, and 5kbp.For every test virus across each length category, we carry out a randomized division of the contigs of the specified lengths.This process is reiterated ten times to ensure robustness.In the case of 1kbp and 2kbp conditions, a cumulative sum of 6150 contigs is generated, whereas the 5kbp condition yields 6140 contigs due to certain virus sequences in the testing set being insufficient to meet the 5kbp criterion.We proceed to juxtapose our method with PHIST, DeepHost, CHERRY, and VHM-net, all of which possess the capability to predict hosts at the species level.The outcomes, illustrated in

Performance comparison on human gut phage dataset
Bacteriophages are plentiful in the human gut.As predators of bacteria, phages have a significant impact on both the composition and function of the human gut microbiome.Furthermore, the use of phage isolates targeted towards gut bacteria has shown promise as a powerful tool for manipulating the microbiome.In several cases [32][33][34], the application of specific phages has proven effective in eliminating pathogens, leading to favorable outcomes.These successful interventions highlight the feasibility of utilizing phages for engineering the gut microbiome.Nevertheless, the lack of universally applicable marker genes and the significant sequence variation observed among phage genomes pose challenges.Consequently, a large proportion of potential bacterial phage sequences identified from metagenomic data obtained from the human gut (ranging from 75% to 99%) cannot be taxonomically classified or associated with specific microbial hosts [35].In a recent publication by Shen et al. [36], they introduced a comprehensive collection of gut phage isolates known as the Gut Phage Isolate Collection (GPIC).This collection was established through the utilization of the soft agar overlay method, which enabled the successful isolation and purification of phages.Remarkably, the authors were able to isolate phages that specifically targeted 42 different bacterial species found in the human gut.These bacterial species encompassed a diverse range, including 15 species from the Bacteroidetes phylum, 19 species from Firmicutes, 4 species from Actinobacteria, and 4 species from Proteobacteria.This achievement highlights the significant progress made in understanding the phage-bacteria interactions within the human gut ecosystem.Combining with the prokaryotes from the supplementary data and CHERRY database, we extracted 144 phages to validate the performance of our model.Fig 3A shows the ditribution of phage target host on species level.We compared the performance of PHPGCA with state-of-the-art methods using the 144 phages extracted from GPIC, and found that PHPGCA still outperformed other methods.In fact, it achieved 6%, 9%, and 3% higher accuracy than the second-best method CHERRY at the levels of species, genes, and family, respectively.

Host prediction on metagenomic data
In this section, we validate the performance of our model in host prediction for putatively novel viruses identified from metagenomic data.Since metagenomic data can contain numerous distinct species or components, it is essential to use prokaryotic virus identification tools to screen viral contigs from metagenomic data before applying our model for host prediction.As an example, widely used tools such as Metaviral spades [37], Seeker [38], and VirSorter [39] can be employed for virus identification.
In our experiment, we utilize MetaHiC [40], a method that detects interactions between phages and assembled bacterial genomes in human gut samples.MetaHiC captures DNA-DNA collisions that occur during phage replication inside bacterial cells, providing a high-quality benchmark for host prediction.We use the phage-bacteria interactions provided by MetaHiC as the ground truth to evaluate the performance of current state-of-the-art methods at the species level.To obtain the necessary data for our evaluation, we use the supplementary data from MetaHiC, which provides bins or contigs and their corresponding taxonomy information.However, it should be noted that only 62 bins have species taxon annotations available.Therefore, we extract contigs from these bins to train and test our model.As shown in Fig 4, our approach still behaves competitive on metagenomic data compared to the stateof-the-art methods.PHPGCA achieved an accuracy rate of 46%, surpassing PHIST by 2%.

Multi-host prediction
Determining whether phages have the ability to infect multiple hosts is important for various reasons, including guiding phage therapy, aiding bacterial identification, understanding phage ecology and evolution, and facilitating biotechnological applications.It provides valuable information for the field of phage research and has practical applications in various fields, including medicine, microbiology, ecology, and food safety.
Case study one: Multi-host prediction of bacteriophage phi92.Bacteriophage phi92 is a large, lytic myovirus that was initially isolated from pathogenic Escherichia coli strains carrying a polysialic acid capsule in 1983.However, further investigation showed that its host range is not limited to polysialic acid-encapsulated E. coli strains, but also includes various laboratory strains of E. coli and many Salmonella strains.In a study by Schwarzer et al. [41], the host specificity of phi92 was re-examined by testing it on multiple bacterial strains, including laboratory strains of E. coli and a wide range of Salmonella strains.The results demonstrated that bacteriophage phi92 can infect both E. coli and Salmonella strains, as evidenced by their respective plating efficiencies.
Fig 5 illustrates the second-order neighbors of bacteriophage phi92.In this network, the neighbors include various labels, encompassing not only the true host that phi92 precisely infected but also other labels represented by the grey-colored nodes.Predicting the accurate label for phi92 using conventional label propagation methods based on the heterogeneous graph can be challenging.However, PHPGCA demonstrates the capability to accurately predict the host of phi92 even in this scenario.We use PHPGCA to identify potential hosts infected by bacteriophage phi92.The prediction scores were sorted in descending order, and the top 5 scores were considered as the predicted host range.Table 1 displays the top 5 predictions obtained from our analysis.The outcomes of our analysis reveal that PHPGCA successfully predicted Escherichia coli and Salmonella enterica as the potential hosts for bacteriophage phi92.Case study two: Multi-host prediction of phages on food safety.Globally, there are approximately 600 million reported cases of foodborne illnesses each year, resulting in the unfortunate loss of 420,000 lives [42].Among the various pathogens responsible for foodborne illnesses, Escherichia coli, Salmonella enterica and Shigella flexneri are recognized as significant contributors to these outbreaks.These two bacterial species are widely acknowledged as major causes of foodborne illnesses, posing a significant threat to public health and food safety worldwide.Phages, characterized by their high diversity, have the unique capability to infect and lyse host bacteria, leading to the release of progeny phages that can initiate subsequent infections.This ability of phages to target and destroy bacterial pathogens has been harnessed and exploited for food safety purposes.The antimicrobial activities of phages have proven  valuable in mitigating the risk of foodborne illnesses caused by bacteria such as Escherichia coli and Salmonella enterica.By specifically targeting and eliminating these bacterial pathogens, phages offer a promising approach to enhance food safety and reduce the incidence of foodborne illnesses.We collect three phages, namely HY01 [43], EscoHU1 [44] and LPEK22 [45], that have demonstrated potential in inhibiting the growth of Escherichia coli, Salmonella enterica and Shigella flexneri in food.Table 2 presents the host range and the strain used for isolating the three phages.We have utilized the PHPGCA to predict the potential host range of HY01, EscoHU1, and LPEK22.Table 3 presents the top 5 prediction scores obtained from PHPGCA, arranged in descending order.The results demonstrate that PHPGCA successfully predicted the multi-species targeted by the three phages, as all of them are included in the predictions.This highlights the accurate predictive capabilities of PHPGCA for determining the potential hosts of phages infecting multiple species.

Ablation study
In our ablation studies, we investigate the impact of different graph convolutional operations and the effectiveness of contrastive augmentation.As shown in Fig 6, our evaluation encompasses several components, including the use of different GCN models and the incorporation of contrastive learning.We perform comparisons involving LightGCN, the graph neural network architecture employed in our proposed model, and GCN, a commonly used graph neural network.Additionally, we assess the performance of LightGCN and GCN when combined with contrastive augmentation.
The experimental results show that the best performance is achieved by the LightGCN graph architecture with contrastive augmentation.This is followed by the combination of GCN with contrastive augmentation.The results suggest that contrastive augmentation enhances the prediction ability of both LightGCN and GCN, without requiring additional annotation data.Without augmentation, the performance of LightGCN and GCN is inferior.Overall, the key components of our model: LightGCN and contrastive augmentation, both contribute to improving the prediction ability in the host prediction task.

The impact of hyperparameters
In this section of our research, we conduct experiments with different hyperparameters to examine their impact on the performance of our proposed model.We find that in the first few layers, the performance becomes better because the representation of each node is aggregated more fully, capturing the feature of the global graph structure.However, when the K increasingly set large leads to the oversmoothing problem.The nodes of the graph aggregate information from their neighbor too many times and the information from different parts of the graph becomes indistinguishable.auxiliary contrastive learning task starts to take effect.The range of peak performance that the model maintains varies across different datasets.However, when lambda still increase larger, the accuracy decrease even perform worse than those without augmentation.
4. Magnitude of noise �.We change the value of � from 0 to 5 shown in Fig 7D .Like λ, the model accuracy increase in the beginning but decrease if the value is set too large.when � is near 0.01, the model achieve best performance on CHERRY dataset.

Discussion
In this study, we propose a novel approach for prokaryotic virus host prediction, where we treat the host prediction task as a recommendation task.We start by constructing a virus-prokaryote heterogeneous graph that integrates abundant information, such as protein similarity between viruses and sequence similarity between viruses and prokaryotes.Subsequently, we employ the graph encoder LightGCN to encode the embeddings of the nodes within the graph.Additionally, we employ a graph contrastive learning method to augment the node representations without requiring additional labels.This is achieved by encouraging representations of the same nodes in two different views to be similar, while representations of different nodes in those views to be distinct.Lastly, we employ a multi-task training strategy to optimize the model for the host prediction task.The main task focuses on predicting the host, while the The experimental results demonstrate that PHPGCA outperforms state-of-the-art methods on three different datasets.Moreover, when applied to metagenomic data, PHPGCA remains competitive with the current state-of-the-art methods.By utilizing PHPGCA, we can calculate preference scores between viruses and prokaryotes, enabling us to effectively assess the host range of newly discovered phages.We further validate our approach through three case studies, showcasing the effectiveness of PHPGCA in determining the host range of these phages.
Despite the performance improvement of PHPGCA in host prediction, there are areas that can be optimized in future work.Firstly, the construction of an accurate and informative virus-prokaryote heterogeneous graph is crucial for the contrastive learning framework.Currently, only virus-virus and virus-prokaryote interactions are utilized, without incorporating prokaryote-prokaryote interactions.Exploring the incorporation of additional biological features in the graph construction could be a potential avenue for further improvement.Secondly, the generation of different views of nodes for contrastive learning is based on simple strategies.Exploring more powerful data augmentation methods may unlock more potential of unlabeled data in improving the performance of PHPGCA in host prediction.

Fig 1 .
Fig 1. Overall framework illustration of our proposed PHPGCA.(A) Construction of the heterogeneous graph, containing virus-virus and virusprokaryote edge construction.(B) The architecture of Light Graph neural network with contrastive augmentation.(C) Model training and host range prediction by ranking the scores in descending order.https://doi.org/10.1371/journal.pcbi.1011671.g001 v2v v indicates the neighbouring viral nodes connected with v. Similarly, N v2p v refers to the neighbouring prokaryotic nodes connected with v, and N p2v p represents the neighbouring viral nodes connected with p.The normalization term Norm helps to prevent the graph from scaling via graph convolution operations.Due to the difference in the layer combination,

v
are the different noise vectors.The noise vectors assigned to individual nodes and layers are distinct, aiming to generate diverse noise and subsequently employ contrastive learning to enhance the robustness of node embeddings.The modulus of the noise vectors kψk 2 = � controls the intensity.
Fig 2C, indicate a discernible trend wherein the performance of all methods improves as the contig length increases.Remarkably, PHPGCA continues to outperform other methods across various contig lengths.

Fig 2 .
Fig 2. Performance on CHERRY dataset.(A) The results on CHERRY dataset, from the species level to the family level, are presented from left to right.(B) Performance of different similarity between training set and testing set at tht species level.X-axis: dashing similarity threshold.Left Y-axis: number of test viruses under given threshold.Right-axis: Accuracy at species level.(C) Performance evaluation of contigs of varying lengths at the species level.X-axis: length of contigs.Y-axis: Accuracy on species level.https://doi.org/10.1371/journal.pcbi.1011671.g002

Fig 3 .
Fig 3.The experimental result on the GPIC dataset.(A): The distribution of different species.(B): Performance of host prediction compared with the state-of-the-art methods from family to species level.https://doi.org/10.1371/journal.pcbi.1011671.g003

Fig 4 .
Fig 4. The description of Hi-C sequencing dataset and host prediction performance.(A): The number of bins corresponding to different species.(B): The number of training contigs corresponding to different species.(C): The number of testing contigs corresponding to different species.(D): The performance of host prediction on testing data.https://doi.org/10.1371/journal.pcbi.1011671.g004

Fig 5 .
Fig 5. Visualization of graph of second-order neighbors of bacteriophage phi92.The nodes are colored based on their respective labels.For prokaryotic nodes, the labels represent their species, whereas for virus nodes, the labels represent their hosts' species.In the visualization, the top 5 species that were predicted are shown in distinct colors, while all other nodes are represented in gray.https://doi.org/10.1371/journal.pcbi.1011671.g005

3 .
Strength of CL λ.As shown in Fig 7C, the prediction performance on all the dataset increase in the begining with the increase of lambda.The phenomenon is expected because the

Table 3 . Top 5 host prediction of three phages.
https://doi.org/10.1371/journal.pcbi.1011671.t003 Embedding size d.For the CHERRY dataset, the prediction accuracy all increase from d = 16 to d = 128.While the embedding size is set small, the model may not be able to represent the full information of the input data.CHERRY dataset achieves best performance when d = 128, which can be the appropriate embedding size for the model.However, the prediction ability may decrease if the embedding size is set too large, causing the overfitting problem and reducing model generalization.2. Number of layers K.By fixing d = 128, we set the range of K from 1 to 10.As shown in Fig 7B, the prediction accuracy is infected by different K.For CHERRY dataset, K = 3 can perform best.
The hyperparameters include the embedding size d, number of layers K, and the augmentation hyperparameters λ and �.Specifically, we fixed other hyperparameters and experimented with different